This document is divided in three sections:
1. SQL basics (data types)
2. SQL usage
3. Database design
Many data types exist in SQL
String variables
Numerical variables
date/time data types
DATE, TIME, TIMESTAMP,
INTERVAL
rental_date + INTERVAL ‘3 days’
arrays
SELECT field[1][1] FROM tableWHERE “text_to_search” = ANY(field_as_array)
WHERE field_as_array @> ARRAY[‘text_to_search’]Access the data types from the INFORMATION_SCHEMA table
SELECT column_name, data_type FROM INFORMATION_SCHEMA.COLUMNS WHERE column_name IN () AND table_name='xxx’;Changing (casting) a column type into another
CAST(value AS new_type) equivalent to
value::new_typeTIMESTAMP gives an
INTERVAL
AGE(TIMESTAMP, TIMESTAMP)
functiontimestamp “2016-05-01” + 21 * INTERVAL ‘1’ day
NOW() + ‘1 year 2 days 3 minutes’::intervalSELECT NOW() -> timestamp with timezone
SELECT NOW()::timestamp (remove the timezone)
SELECT CAST(NOW() as timestamp) is universalSELECT CURRENT_TIMESTAMP(2) now() rounded at 2 digits
SELECT CURRENT_DATE -> DateSELECT CURRENT_TIME -> Time with timezoneEXTRACT(field_from_source)
SELECT EXTRACT(month FROM NOW() AS month;
SELECT DATE_PART(‘quarter’, NOW()) AS quarter
SELECT DATE_TRUNC(‘month’, NOW())
to_char(date_created, 'day’) converts day of weeks to
Monday, Tuesday, etc.Concatenate strings:
SELECT field1 || ‘sep’ || field2 AS new_string
CONCAT(field1, sep, field2)
functionUPPER(field), LOWER(field),
INITCAP(field)
REPLACE(field, “str_to_change”, “new_str”)
REVERSE(field)
CHAR_LENGTH(field)
LENGTH() also workdsPOSITION(‘str' IN field)
LEFT(field, n), RIGHT(field,n)
SUBSTRING(field, 10, 50)
SUBSTR(email FROM 0 FOR POSITION(‘@‘ IN email)
SUBSTRING(email FROM POSITION(‘@‘ IN email)+1 FOR CHAR_LENGTH(email))TRIM(leading/trailing/both(default) ‘ ‘ from string)
SELECT TRIM(“ word “) -> “word”LTRIM() or RTRIM() or BTRIM()
[b for both] can also be used.TRIM(street, “ 0123456789#/.”)
LPAD(‘padded’, 10, ‘%’) ; RPAD()
SPLIT_PART(string, delimiter, part)
Full-text search
SELECT * FROM table LIMIT 3
SELECT 50+2, 51*2, 17/3
WHERE field LIKE ‘%valier’
WHERE field. This will return all the rows where the
field ‘field’ is not 0.
WHERE release_year <> 1960 means that we want all
but those entries with release_year = 1960WHERE release_year BETWEEN 1994 AND 2000
WHERE name LIKE ‘Ade%” vs
WHERE name LIKE “Ev_”, name NOT LIKE “A. %“
WHERE fav_fruit ILIKE “%apple%”
ILIKE is case-insensitive!!ORDER BY field DESC / ASC
SELECT “field name”USING (field) if the field has the same name in the two
tables;ON (table1.field = table2.field) AND (...)SELECT * FROM beverages CROSS JOIN food_itemsSELECT * FROM table1 INNER JOIN table2 USING (common_id)SELECT * FROM table1 LEFT JOIN table2 USING (common_id)SELECT * FROM table1 FULL OUTER JOIN table2 USING (common_id)SELECT * FROM table1 AS name1 LEFT JOIN table1 AS name2 ON (same1.fieldX = name2.fieldY)FROM t1 LEFT JOIN t2 ON t2.field >= t1.field1 AND t2.field < t1.field2SELECT * FROM t1 WHERE field IN (SELECT field FROM t2 WHERE)SELECT * FROM t1 WHERE field NOT IN (SELECT field FROM t2 WHERE)SELECT * FROM t1 UNION SELECT * FROM t2
UNION ALL exist and will create duplicates if need
beSELECT * FROM t1 INTERSECT SELECT * FROM t2SELECT * FROM t1 EXCEPT SELECT * FROM t2COALESCE(field, ‘new_value’)
STRING_AGG(field, sep=', ‘)
SELECT field AS new_nameROUND(field, 2) ; TRUNC(field, -2)
SELECT generate_series(1, 10, 2)
SELECT (gross - budget) FROM films; will return the
difference between th two fields for each entryDISTINCT opere sur un ensemble de lignesALTER TABLE film DROP COLUMN long_description;SELECT COUNT(*) FROM table—> Compte le nombre de
lignes
SELECT COUNT(field) FROM table —> Compte le nombre
de champ non-null dans field. Equivalent count(*) where field is not
nullCASE WHEN <clause> THEN <result> WHEN <clause> THEN <result> ... ELSE <result> ENDSUM, MIN, MAX,
AVG (en fonction du langage MySQL, PostgreSQL, sqlite, AVG,
AVERAGE, mEAN …), LENGTHVAR(field) calculates the sample variance (divided by
n-1).
var_pop(field) (divided by n)STDDEV the standard deviationCORR(field1, field2)
SELECT percentile_disc(percentile_desired) WITHIN GROUP (ORDER BY field)SELECT percentile_cont(percentile_desired) WITHIN GROUP (ORDER BY field)
* CTE for ‘Common table expression’ allows creating a temporary table
* `WITH aggregate_table as (SELECT * FROM ff GROUP BY xxx)`
`SELECT * FROM aggregata_Table WHERE ...`
* the 2nd, 3rd, etc CTE can refer to the previous CTEs
HAVING (a where applied after a group by).
GROUPING SETS: performe plusieurs GROUP BY à different
niveaux, puis fait une UNION des résultats
GROUP BY GROUPING SETS ((store_id, product_name), store_id, ())
GROUP BY ROLLUP (year, region, departement) HAVING year NOT NULL
GROUP BY CUBE (year, region, departement)
SELECT sum(field1) FILTER (WHERE field2 = ‘kn’)
EXISTS key word
SELECT * FROM movies AS m WHERE EXISTS (SELECT * FROM renting AS r WHERE rating is NOT NULL AND r.movie_id=m.movie_id);
NOT EXISTSOVER() applies an aggregating function over (part of)
the table
SELECT SUM(field1) OVER(ORDER by field2) —> Fait la
somme cumulée de champ 1 apres avoir triee la table sur field2AVG(daily_sales) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_sales_3_daysAVG(daily_sales) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND 3 FOLLOWING) as avg_sales_3_daysSUM(daily_sales) OVER(ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulated_sum
MAX(field1) OVER(PARTITION BY field2) —> will apply
the function over the field by splitted by categories defined by field2
SELECT field1, LAG(field1, n) OVER() as field1_shifted_by_n
ROW_NUMBER() OVER()
RANK() OVER(ORDER BY AVG(field1+field2))
DENSE_RANK() OVER()
QUALIFY() allows to make a filtering n the results of
an aggregation (OVER())
LAG(var, n = lag_distance;default1)
LEAD(var, n)LEAD() and LAG() are relative function
since they fetch values relative to the current row.FIRST_VALUE()/LAST_VALUE(column) are absolute
function.NTILE(n) groups the data in n categories and it can be
easy to get statistics, for instance, for the 1st, 2nd, 3rd and 4th
quartiles.Note: GROUP BY reduit la taille de ma data frame alors que le OVER maintient la dimension initiale.
CREATE EXTENSION IF NOT EXISTS tablefunc;SELECT * FROM CROSSTAB( $$
request/table to crosstab
$$) AS ct (var1 VARCHAR, var2 VARCHAR ...)
ORDER BY var3CREATE VIEW view_name AS SELECT ... FROM ... WHERE ...;SELECT * FROM INFORMATION_SCHEMA.views;
WHERE table_schema NOT in (‘pg_catalog’, ‘information_schema’)DROP VIEW view_name [cascade | restrict]CREATE OR REPLACE VIEW view_name AS (new_query)CREATE MATERIALIZED VIEW my_view AS SELECT ...REFRESH MATERIALIZED VIEW my_view
CREATE TEMP TABLE tablename AS SELECT * FROM *
INSERT INTO new_table (SELECT * FROM *)UPDATE new_table SET colum = new_val WHERE condCREATE FUNCTION squared(i integer) RETURNS integer AS $$
BEGIN RETURN i * i; END;
$$ LANGUAGE plpgsqldifferent approaches to processing data OLTP vs. OLAP: * OLTP: Online Transaction Processing * Find the price of a book, update latest customer transaction * OLTP focus on supporting day-to-day operations * OLAP: Online Analytical Processing * Calculate the books with best profit margin, find most loyal customers * OLAP tasks are vaguer and focus on business decision making
ETL (Extract Tranform Load): The most traditional vs. ELT (Extract Load Transform): this approach is growing with big data
In the examples below, the first table example does NOT conform to the the normalization rule; the second table contains the same data and does conform to the normalization rule
CREATE TABLE tablename( fieldname1 INTEGER PRIMARY KEY, fieldname2 FLOAT NOT NULL, fieldname3 VARCHAR(150) NOT NULL )ALTER TABLE fact_booksales ADD CONSTRAINT sales_store FOREIGN KEY (store_id) REFERENCES dim_store_star (store_id);ALTER TABLE dim_author ADD COLUMN author_id SERIAL PRIMARY KEY;
Table partitionning: * Table grows too large and queries become slow
`CREATE TABLE sales (`
`...`
` timestamp DATE NOT NULL`
`)`
`PARTITION BY RANGE (timestamp)`
`CREATE TABLE sales_2019_q1 PARTITION OF sales FOR VALUES FROM (‘2019-01-01’) TO (‘2019-03-31’)`
`...`
`CREATE TABLE sales_2019_q4 PARTITION OF sales FOR VALUES FROM (‘2019-10-01’) TO (‘2019-12-31')`
`CREATE INDEX ON sales (‘timestamp’)`
CREATE ROLE data_analyst;CREATE ROLE intern WITH PASSWORD 'PasswordForIntern’ VALID UNTIL ‘2020-01-01'CREATE ROLE admin WITH CREATEDB LOGIN;ALTER ROLE admin CREATEROLE;GRANT UPDATE ON ratings to data_analyst;REVOKE UPDATE ON ratings to data_analyst;ALTER ROLE marta WITH PASSWORD 's3cur3p@ssw0rd';GRANT data_analyst TO alex;GRANT/REVOKE privileges ON viewname TO/FROM role
GRANT UPDATE ON ratings TO public